Challenge 9
Creating Nice Tables
Interactive preview of the dataset
Code
#interactice preview of dataset
datatable(a_names)Warning in instance$preRenderHook(instance): It seems your data is too big
for client-side DataTables. You may consider server-side processing: https://
rstudio.github.io/DT/server.html
Part 1 – Formatting Basic Tables
Use the kable() function from the kable package to create nice, basic HTML tables for every table in this lab.
3) Summarizing & Visualizing the Number of Allisons
- Make a summary table of the number of babies named “Allison” for each state and the sex of the baby. Specifically, each state should be its own row and each sex should have its own column. Additionally, if there were no babies born with the name “Allison” for that combination of year, state, & sex the table should have a value of 0 (not an
NA).
Code
sum_table_a <- a_names |>
pivot_wider(names_from = `Sex at Birth`, values_from = Count) |>
filter(Name == "Allison") |>
group_by(State) |>
summarize(across(.cols = c("F", "M"), .fns = sum)) |>
mutate(across(.cols = c("F", "M"), ~replace_na(., 0)))
kable(head(sum_table_a), booktabs = TRUE) %>%
kableExtra::kable_styling(font_size = 15)| State | F | M |
|---|---|---|
| AK | 232 | 0 |
| AL | 1535 | 0 |
| AR | 1198 | 0 |
| AZ | 1880 | 0 |
| CA | 12413 | 0 |
| CO | 1594 | 0 |
Code
allisons_female <- a_names |>
filter(Name == "Allison",
`Sex at Birth` == "F") |>
group_by(State) |>
summarize("Female Allisons" = sum(Count))
#kable(head(allisons_female))
datatable(allisons_female)You should have seen in the table above that “Allison” is a name given overwhelmingly to babies assigned female at birth. So,
filterthe data to include only babies assignedFemale at birth.Make a visualization showing how the popularity of the name “Allison” has changed over the years.
Code
a_names |>
filter(Name == "Allison",
`Sex at Birth` == "F") |>
ggplot(mapping = aes(x = Year, y = Count)) +
geom_col(fill = "purple") +
labs(x = "",
y = "",
title = "Count of Females named Allison (1997-2014)")4) Modeling the Number of Allisons
- Create a linear model with the year as the explanatory variable, and the number of Allisons as the response.
Code
allisons_count <- a_names %>%
filter(Name == "Allison") %>%
lm(Count ~ Year, data = .)
summary(allisons_count)
Call:
lm(formula = Count ~ Year, data = .)
Residuals:
Min 1Q Median 3Q Max
-130.91 -90.77 -43.53 40.94 859.60
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3881.8855 1765.2407 2.199 0.0281 *
Year -1.8758 0.8802 -2.131 0.0333 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 138 on 912 degrees of freedom
Multiple R-squared: 0.004955, Adjusted R-squared: 0.003864
F-statistic: 4.542 on 1 and 912 DF, p-value: 0.03335
- Visualize the regression model.
Code
allisons_count |>
ggplot(mapping = aes(x = Year, y = Count)) +
geom_point() +
labs(x = "",
y = "",
title = "Count of Females named Allison (1997-2014)")Write out the estimated regression equation.
y = 3881.89 - 1.88*x + e
Plot the residuals of the model, that is, the actual values minus the predicted values. Comment on the residuals - do you see any patterns?
Code
allisons_count |>
broom::augment() |>
ggplot(mapping = aes(y = .resid, x = .fitted)) +
geom_point()The residuals show a pattern in the plot. They get increasingly more dispersed as the model moves away from zero. This indicates that this may not be the most ideal model for the data.
- What do you conclude from this model?
We can conclude that this model may not be the best representation of the data, because the residual graphs shows a slight pattern. A similar model with perhaps more explanatory variables might be e better fit of the data.
5) Spelling by State
- Narrow the A name dataset (downloaded previously) down to only male-assigned babies named “Allan”, “Alan”, or “Allen”. Make a plot comparing the popularity of these names over time.
Code
allen_data <- a_names |>
filter(Name == "Allan" | Name == "Alan" | Name =="Allen")Code
allen_data |>
ggplot(mapping = aes(x = Name, y = Count, fill = Name)) +
geom_col() +
theme(legend.position = "none") +
labs(x = "",
y = "",
title = "Counts of Allen, Alan, and Allan (1997-2014)")- In California, Allan’s spelling of his name is the least common of the three, but perhaps it’s not such an unusual name for his home state of Pennsylvania. Compute the total number of babies born with each spelling of “Allan” in 2000, in Pennsylvania and in California. Specifically, each spelling should be its own column and each state should have its own row. Similar to before, a
0(not anNA) should be used to represent locations where there were no instances of these names.
Code
allen_names <- allen_data |>
pivot_wider(names_from = Name, values_from = Count) |>
filter(Year == 2000, `Sex at Birth` == "M", State == "CA" | State == "PA") |>
mutate(across(.cols = Allen:Allan, ~replace_na(., 0)))
kable(head(allen_names), booktabs = TRUE) %>%
kableExtra::kable_styling(font_size = 15)| Year | Sex at Birth | State | Allen | Alan | Allan |
|---|---|---|---|---|---|
| 2000 | M | CA | 176 | 579 | 131 |
| 2000 | M | PA | 56 | 51 | 12 |
- Convert your total counts to overall percents. That is, what was the percent breakdown between the three spellings in CA? What about in PA?
Code
allen_p <- allen_data |>
pivot_wider(names_from = Name, values_from = Count) |>
mutate(across(.cols = Allen:Allan, ~replace_na(., 0))) |>
filter(Year == 2000, `Sex at Birth` == "M", State == "CA" | State == "PA") |>
mutate(Allen = round(Allen/ sum(across(Allen:Allan)), 2),
Alan = round(Alan/ sum(across(Allen:Allan)), 2),
Allan = round(Allan/ sum(across(Allen:Allan)), 2))
kable(head(allen_p), booktabs = TRUE) %>%
kableExtra::kable_styling(font_size = 15)| Year | Sex at Birth | State | Allen | Alan | Allan |
|---|---|---|---|---|---|
| 2000 | M | CA | 0.18 | 0.75 | 0.91 |
| 2000 | M | PA | 0.06 | 0.07 | 0.08 |